{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "f8f6fcbd",
   "metadata": {},
   "source": [
    "# Deploy a medium-sized LLM\n",
    "\n",
    "<div align=\"left\">\n",
    "<a target=\"_blank\" href=\"https://console.anyscale.com/template-preview/deployment-serve-llm?file=%252Ffiles%252Fmedium-size-llm\"><img src=\"https://img.shields.io/badge/🚀 Run_on-Anyscale-9hf\"></a>&nbsp;\n",
    "<a href=\"https://github.com/ray-project/ray/tree/master/doc/source/serve/tutorials/deployment-serve-llm/medium-size-llm\" role=\"button\"><img src=\"https://img.shields.io/static/v1?label=&amp;message=View%20On%20GitHub&amp;color=586069&amp;logo=github&amp;labelColor=2f363d\"></a>&nbsp;\n",
    "</div>\n",
    "\n",
    "This tutorial shows you how to deploy and serve a medium language model in production with Ray Serve LLM. A medium LLM typically runs on a single node with 4-8 GPUs. It offers a balance between performance and efficiency. This tutorial deploys Llama-3.1-70&nbsp;B, a medium-sized LLM with 70&nbsp;B parameters. These models provide stronger accuracy and reasoning than small models while remaining more affordable and resource-friendly than very large ones. This makes them a solid choice for production workloads that need good quality at lower cost.\n",
    "\n",
    "For smaller models, see [Deploy a small-sized LLM](https://docs.ray.io/en/latest/serve/tutorials/deployment-serve-llm/small-size-llm/README.html). For larger models, see [Deploy a large-sized LLM](https://docs.ray.io/en/latest/serve/tutorials/deployment-serve-llm/large-size-llm/README.html).\n",
    "\n",
    "---\n",
    "\n",
    "## Configure Ray Serve LLM\n",
    "\n",
    "You can deploy a medium-sized LLM on a single node with multiple GPUs. To leverage all available GPUs, set `tensor_parallel_size` to the number of GPUs on the node, which distributes the model’s weights evenly across them.\n",
    "\n",
    "Ray Serve LLM provides multiple [Python APIs](https://docs.ray.io/en/latest/serve/api/index.html#llm-api) for defining your application. Use [`build_openai_app`](https://docs.ray.io/en/latest/serve/api/doc/ray.serve.llm.build_openai_app.html#ray.serve.llm.build_openai_app) to build a full application from your [`LLMConfig`](https://docs.ray.io/en/latest/serve/api/doc/ray.serve.llm.LLMConfig.html#ray.serve.llm.LLMConfig) object.\n",
    "\n",
    "Set your Hugging Face token in the config file to access gated models like `Llama-3.1`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d185d580",
   "metadata": {},
   "outputs": [],
   "source": [
    "# serve_llama_3_1_70b.py\n",
    "from ray.serve.llm import LLMConfig, build_openai_app\n",
    "import os\n",
    "\n",
    "llm_config = LLMConfig(\n",
    "    model_loading_config=dict(\n",
    "        model_id=\"my-llama-3.1-70b\",\n",
    "        # Or unsloth/Meta-Llama-3.1-70B-Instruct for an ungated model\n",
    "        model_source=\"meta-llama/Llama-3.1-70B-Instruct\",\n",
    "    ),\n",
    "    accelerator_type=\"L40S\", # Or \"A100-40G\"\n",
    "    deployment_config=dict(\n",
    "        autoscaling_config=dict(\n",
    "            min_replicas=1,\n",
    "            max_replicas=4,\n",
    "        )\n",
    "    ),\n",
    "    ### If your model is not gated, you can skip `HF_TOKEN`\n",
    "    # Share your Hugging Face token with the vllm engine so it can access the gated Llama 3.\n",
    "    # Type `export HF_TOKEN=<YOUR-HUGGINGFACE-TOKEN>` in a terminal\n",
    "    runtime_env=dict(env_vars={\"HF_TOKEN\": os.environ.get(\"HF_TOKEN\")}),\n",
    "    engine_kwargs=dict(\n",
    "        max_model_len=32768,\n",
    "        # Split weights among 8 GPUs in the node\n",
    "        tensor_parallel_size=8,\n",
    "    ),\n",
    ")\n",
    "\n",
    "app = build_openai_app({\"llm_configs\": [llm_config]})\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6b2231a5",
   "metadata": {},
   "source": [
    "**Note:** Before moving to a production setup, migrate to using a [Serve config file](https://docs.ray.io/en/latest/serve/production-guide/config.html) to make your deployment version-controlled, reproducible, and easier to maintain for CI/CD pipelines. See [Serving LLMs - Quickstart Examples: Production Guide](https://docs.ray.io/en/latest/serve/llm/quick-start.html#production-deployment) for an example.\n",
    "\n",
    "---\n",
    "\n",
    "## Deploy locally\n",
    "\n",
    "**Prerequisites**\n",
    "\n",
    "* Access to GPU compute.\n",
    "* (Optional) A **Hugging Face token** if using gated models like Meta’s Llama. Store it in `export HF_TOKEN=<YOUR-HUGGINGFACE-TOKEN>`.\n",
    "\n",
    "**Note: **Depending on the organization, you can usually request access on the model's Hugging Face page. For example, Meta’s Llama model approval can take anywhere from a few hours to several weeks.\n",
    "\n",
    "**Dependencies:**  \n",
    "```bash\n",
    "pip install \"ray[serve,llm]\"\n",
    "```\n",
    "\n",
    "---\n",
    "\n",
    "### Launch\n",
    "\n",
    "Follow the instructions at [Configure Ray Serve LLM](#configure-ray-serve-llm) to define your app in a Python module `serve_llama_3_1_70b.py`.  \n",
    "\n",
    "In a terminal, run:  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ae9da12c",
   "metadata": {},
   "outputs": [],
   "source": [
    "export HF_TOKEN=<YOUR-HUGGINGFACE-TOKEN>\n",
    "serve run serve_llama_3_1_70b:app --non-blocking"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "96d18e22",
   "metadata": {},
   "source": [
    "Deployment typically takes a few minutes as the cluster is provisioned, the vLLM server starts, and the model is downloaded. \n",
    "\n",
    "---\n",
    "\n",
    "### Send requests\n",
    "\n",
    "Your endpoint is available locally at `http://localhost:8000` and you can use a placeholder authentication token for the OpenAI client, for example `\"FAKE_KEY\"`.\n",
    "\n",
    "Example curl:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a1dd345c",
   "metadata": {},
   "outputs": [],
   "source": [
    "curl -X POST http://localhost:8000/v1/chat/completions \\\n",
    "  -H \"Authorization: Bearer FAKE_KEY\" \\\n",
    "  -H \"Content-Type: application/json\" \\\n",
    "  -d '{ \"model\": \"my-llama-3.1-70b\", \"messages\": [{\"role\": \"user\", \"content\": \"What is 2 + 2?\"}] }'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dca5e4fd",
   "metadata": {},
   "source": [
    "Example Python:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "584f01f7",
   "metadata": {},
   "outputs": [],
   "source": [
    "#client.py\n",
    "from urllib.parse import urljoin\n",
    "from openai import OpenAI\n",
    "\n",
    "API_KEY = \"FAKE_KEY\"\n",
    "BASE_URL = \"http://localhost:8000\"\n",
    "\n",
    "client = OpenAI(base_url=urljoin(BASE_URL, \"v1\"), api_key=API_KEY)\n",
    "\n",
    "response = client.chat.completions.create(\n",
    "    model=\"my-llama-3.1-70b\",\n",
    "    messages=[{\"role\": \"user\", \"content\": \"Tell me a joke\"}],\n",
    "    stream=True\n",
    ")\n",
    "\n",
    "for chunk in response:\n",
    "    content = chunk.choices[0].delta.content\n",
    "    if content:\n",
    "        print(content, end=\"\", flush=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1a5fd1fb",
   "metadata": {},
   "source": [
    "\n",
    "---\n",
    "\n",
    "### Shutdown\n",
    "\n",
    "Shutdown your LLM service: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1c03cdb9",
   "metadata": {},
   "outputs": [],
   "source": [
    "serve shutdown -y"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dc223463",
   "metadata": {},
   "source": [
    "\n",
    "---\n",
    "\n",
    "## Deploy to production with Anyscale services\n",
    "\n",
    "For production deployment, use Anyscale services to deploy the Ray Serve app to a dedicated cluster without modifying the code. Anyscale ensures scalability, fault tolerance, and load balancing, keeping the service resilient against node failures, high traffic, and rolling updates.\n",
    "\n",
    "---\n",
    "\n",
    "### Launch the service\n",
    "\n",
    "Anyscale provides out-of-the-box images (`anyscale/ray-llm`), which come pre-loaded with Ray Serve LLM, vLLM, and all required GPU/runtime dependencies. This makes it easy to get started without building a custom image.\n",
    "\n",
    "Create your Anyscale service configuration in a new `service.yaml` file:\n",
    "```yaml\n",
    "# service.yaml\n",
    "name: deploy-llama-3-70b\n",
    "image_uri: anyscale/ray-llm:2.49.0-py311-cu128 # Anyscale Ray Serve LLM image. Use `containerfile: ./Dockerfile` to use a custom Dockerfile.\n",
    "compute_config:\n",
    "  auto_select_worker_config: true \n",
    "working_dir: .\n",
    "cloud:\n",
    "applications:\n",
    "  # Point to your app in your Python module\n",
    "  - import_path: serve_llama_3_1_70b:app\n",
    "```\n",
    "\n",
    "Deploy your service. Make sure you forward your Hugging Face token to the command."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fa1c6108",
   "metadata": {
    "pygments_lexer": "bash"
   },
   "outputs": [],
   "source": [
    "anyscale service deploy -f service.yaml --env HF_TOKEN=<YOUR-HUGGINGFACE-TOKEN>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "18226fd7",
   "metadata": {},
   "source": [
    "**Custom Dockerfile**  \n",
    "You can customize the container by building your own Dockerfile. In your Anyscale Service config, reference the Dockerfile with `containerfile` (instead of `image_uri`):\n",
    "\n",
    "```yaml\n",
    "# service.yaml\n",
    "# Replace:\n",
    "# image_uri: anyscale/ray-llm:2.49.0-py311-cu128\n",
    "\n",
    "# with:\n",
    "containerfile: ./Dockerfile\n",
    "```\n",
    "\n",
    "See the [Anyscale base images](https://docs.anyscale.com/reference/base-images) for details on what each image includes.\n",
    "\n",
    "---\n",
    "\n",
    "### Send requests \n",
    "\n",
    "The `anyscale service deploy` command output shows both the endpoint and authentication token:\n",
    "```console\n",
    "(anyscale +3.9s) curl -H \"Authorization: Bearer <YOUR-TOKEN>\" <YOUR-ENDPOINT>\n",
    "```\n",
    "You can also retrieve both from the service page in the Anyscale console. Click the **Query** button at the top. See [Send requests](#send-requests) for example requests, but make sure to use the correct endpoint and authentication token.  \n",
    "\n",
    "---\n",
    "\n",
    "### Access the Serve LLM dashboard\n",
    "\n",
    "See [Monitor your deployment](#monitor-your-deployment) for instructions on enabling LLM-specific logging. To open the Ray Serve LLM dashboard from an Anyscale service:\n",
    "1. In the Anyscale console, go to your **Service** or **Workspace**\n",
    "2. Navigate to the **Metrics** tab\n",
    "3. Click **View in Grafana** and click **Serve LLM Dashboard**\n",
    "\n",
    "---\n",
    "\n",
    "### Shutdown \n",
    " \n",
    "Shutdown your Anyscale service:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "211d5baf",
   "metadata": {},
   "outputs": [],
   "source": [
    "anyscale service terminate -n deploy-llama-3-70b"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1d8fba49",
   "metadata": {},
   "source": [
    "\n",
    "---\n",
    "\n",
    "## Monitor your deployment\n",
    "\n",
    "Ray Serve LLM provides comprehensive monitoring through the Serve LLM Dashboard. This dashboard visualizes key metrics including:\n",
    "\n",
    "- **Time to First Token (TTFT)**: Latency before the first token is generated.\n",
    "- **Time Per Output Token (TPOT)**: Average latency per generated token.\n",
    "- **Token throughput**: Total tokens generated per second.\n",
    "- **GPU cache utilization**: Percentage of KV cache memory in use.\n",
    "- **Request latency**: End-to-end request duration.\n",
    "\n",
    "To enable engine-level metrics, set `log_engine_metrics: true` in your LLM configuration. This is enabled by default starting with Ray 2.51.0.\n",
    "\n",
    "The following example shows how to enable monitoring:\n",
    "\n",
    "```python\n",
    "llm_config = LLMConfig(\n",
    "    # ... other config ...\n",
    "    log_engine_metrics=True,  # Enable detailed metrics\n",
    ")\n",
    "```\n",
    "\n",
    "### Access the dashboard\n",
    "\n",
    "To view metrics in an Anyscale Service or Workspace:\n",
    "\n",
    "1. Navigate to your **Service** or **Workspace** page.\n",
    "2. Open the **Metrics** tab.\n",
    "3. Expand **View in Grafana** and select **Serve LLM Dashboard**.\n",
    "\n",
    "For a detailed explanation of each metric and how to interpret them for your workload, see [Understand LLM latency and throughput metrics](https://docs.anyscale.com/llm/serving/benchmarking/metrics).\n",
    "\n",
    "For comprehensive monitoring strategies and best practices, see the [Observability and monitoring guide](https://docs.ray.io/en/latest/serve/llm/user-guides/observability.html).\n",
    "\n",
    "---\n",
    "\n",
    "## Improve concurrency\n",
    "\n",
    "Ray Serve LLM uses [vLLM](https://docs.vllm.ai/en/latest/) as its backend engine, which logs the *maximum concurrency* it can support based on your configuration.  \n",
    "\n",
    "Example log for 8xL40S:\n",
    "```console\n",
    "INFO 08-19 20:57:37 [kv_cache_utils.py:837] Maximum concurrency for 32,768 tokens per request: 17.79x\n",
    "```\n",
    "\n",
    "The following are a few ways to improve concurrency depending on your model and hardware:  \n",
    "\n",
    "**Reduce `max_model_len`**  \n",
    "Lowering `max_model_len` reduces the memory needed for KV cache.\n",
    "\n",
    "**Example:** Running Llama-3.1-70&nbsp;B on 8xL40S:\n",
    "* `max_model_len = 32,768` → concurrency ≈ 18\n",
    "* `max_model_len = 16,384` → concurrency ≈ 36\n",
    "\n",
    "**Use Quantized models**  \n",
    "Quantizing your model (for example, to FP8) reduces the model's memory footprint, freeing up memory for more KV cache and enabling more concurrent requests.\n",
    "\n",
    "**Use pipeline parallelism**  \n",
    "If a single node isn't enough to handle your workload, consider distributing the model's layers across multiple nodes with `pipeline_parallel_size > 1`.\n",
    "\n",
    "**Upgrade to GPUs with more memory**  \n",
    "Some GPUs provide significantly more room for KV cache and allow for higher concurrency out of the box.\n",
    "\n",
    "**Scale with more replicas**  \n",
    "In addition to tuning per-replica concurrency, you can scale *horizontally* by increasing the number of replicas in your config.  \n",
    "Raising the replica count increases the total number of concurrent requests your service can handle, especially under sustained or bursty traffic.\n",
    "```yaml\n",
    "deployment_config:\n",
    "  autoscaling_config:\n",
    "    min_replicas: 1\n",
    "    max_replicas: 4\n",
    "```\n",
    "\n",
    "*For more details on tuning strategies, hardware guidance, and serving configurations, see [Choose a GPU for LLM serving](https://docs.anyscale.com/llm/serving/gpu-guidance) and [Tune parameters for LLMs on Anyscale services](https://docs.anyscale.com/llm/serving/parameter-tuning).*\n",
    "\n",
    "---\n",
    "\n",
    "## Troubleshooting\n",
    "\n",
    "If you encounter issues when deploying your LLM, such as out-of-memory errors, authentication problems, or slow performance, consult the [Troubleshooting Guide](https://docs.anyscale.com/llm/serving/troubleshooting) for solutions to common problems.\n",
    "\n",
    "---\n",
    "\n",
    "## Summary\n",
    "\n",
    "In this tutorial, you deployed a medium-sized LLM with Ray Serve LLM, from development to production. You learned how to configure and deploy your service, send requests, monitor performance metrics, and optimize concurrency.\n",
    "\n",
    "To learn more, take the [LLM Serving Foundations](https://courses.anyscale.com/courses/llm-serving-foundations) course or explore [LLM batch inference](https://docs.anyscale.com/llm/batch-inference) for offline workloads. For smaller models, see [Deploy a small-sized LLM](https://docs.ray.io/en/latest/serve/tutorials/deployment-serve-llm/small-size-llm/README.html) or for larger models, see [Deploy a large-sized LLM](https://docs.ray.io/en/latest/serve/tutorials/deployment-serve-llm/large-size-llm/README.html)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "repo_ray_docs",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.12.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
